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Abstract Discrete energy minimization is a ubiquitous task 
in computer vision, yet is NP-hard in most cases. In this 
work we propose a multiscale framework for coping with the 
NP-hardness of discrete optimization. Our approach utilizes 
algebraic multiscale principles to efficiently explore the dis- 
crete solution space, yielding improved results on challeng- 
ing, non-submodular energies for which current methods pro- 
vide unsatisfactory approximations. In contrast to popular 
multiscale methods in computer vision, that builds an im- 
age pyramid, our framework acts directly on the energy to 
construct an energy pyramid. Deriving a multiscale scheme 
from the energy itself makes our framework application in- 
dependent and widely applicable. Our framework gives rise 
to two complementary energy coarsening strategies: one in 
which coarser scales involve fewer variables, and a more 
revolutionary one in which the coarser scales involve fewer 
discrete labels. We empirically evaluated our unified frame- 
work on a variety of both non-submodular and submodular 
energies, including energies from Middlebury benchmark. 

Keywords Optimization • Discrete energy minimization • 
Non-submodular • Multiscale • Algebraic multigrid 



1 Introduction 

Discrete energy minimization is ubiquitous in computer vi- 
sion, and spans a variety of problems. These energies can be 
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Figure 1 A Unified multiscale framework: We derive multiscale 
representation of the energy itself - energy pyramid. Our multiscale 
framework is unified in the sense that different problems with differ- 
ent energies share the same multiscale scheme, making our framework 
widely applicable and general. 



grossly divided into two classes: submodular and non-sub- 
modular energies. Submodular energies are characterized by 
"smoothness" encouraging pairwise (or higher order) terms . 
Apart from the binary case, minimizing these energies is 
known to be NP-hard. Despite this theoretical hardness, such 
submodular energies, which naturally reflect a "piecewise 
constant" prior, gained popularity and became very common 
in computer vision applications, such as denoising, stereo 
and multi-label segmentation (e.g., Szeliski et al (2008)). 
For this reason most of the efforts of the vision community 
regarding discrete optimization focused on developing ap- 
proximate optimization methods for these submodular en- 
ergies, yielding quite successful algorithms. Recently, more 
challenging, non-submodular energies started to gain pop- 
ularity. These energies are characterized by a combination 
of "smooth" and "non-smooth" encouraging pairwise terms. 
The correlation-clustering functional, recently applied to seg- 
mentation, co- segmentation and clustering (e.g., Glasner et al 
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(2011); Bagon and Galun (2011)), is an example for such 
non-submodular energy. Moreover, non-submodular ener- 
gies may appear when the parameters of the energy are au- 
tomatically learned (e.g., Nowozin et al (2011)). Since such 
non-submodular energies are only recently explored, their 
optimization receives less attention, and consequently, the 
existing optimization methods provide approximations that 
may be quite unsatisfactory. In practice, it is generally con- 
sidered a more challenging task to optimize non-submodular 
energies. 



But what makes discrete en- 
ergy minimization such a chal- 
lenging endeavor? The fact that 
this minimization implies an 
exploration of an exponentially 
large search space. One way 




to alleviate this difficulty is to 
use multiscale search. The il- 
lustration on the right shows a toy "energy" E(L) at dif- 
ferent scales of detail. Considering only the original scale 
(5 = 0), it is very difficult to suggest an effective exploration 
(optimization) method. However, when looking at coarser 
scales (s = 1 , . . . , 3) of the energy an interesting phenomenon 
is revealed. At the coarsest scale (s = 3) the large basins 
of attraction emerge, but with very low accuracy. As the 
scales become finer (s = 2, . . . , 0), one "loses sight" of 
the large basins, but may now "sense" more local properties 
with higher accuracy. We term this well known phenomenon 
as the multiscale landscape of the energy. This multiscale la- 
ndscape phenomenon encourages coarse-to-fine exploration 
strategies: starting with the large basins that are apparent at 
coarse scales, and then gradually and locally refining the se- 
arch at finer scales. 

For more than three decades the vision community fo- 
cuses on the multiscale pyramid of images (e.g., Lucas and 
Kanade (1981); Burt and Adelson (1983)). There is almost 
no experience and no methods that apply a multiscale scheme 
directly to discrete energies. 

Another domain in which multiscale methods are com- 
mon practice is numerical PDE solvers. Early works in that 
domain applied geometric coarsening (geometric multigrid), 
which is the analogue of the classical image pyramid. A so- 
lution for a PDE was then obtained by applying a single- 
scale solver at each scale (relaxation). This geometric multi- 
grid paradigm suggested a very simple construction of a reg- 
ular pyramid at the cost of very careful design of single- 
scale solvers, tailoring them for each problem separately. 
A breakthrough for the PDE community was the develop- 
ment of algebraic multigrid (AMG) of Brandt (1986). The 
algebraic multigrid approach suggests to derive the pyramid 



directly from the underlying problem, resulting with irregu- 
lar data-driven pyramid. This way, local and general solvers 
(e.g., Gauss-Seidel relaxation) can be incorporated into the 
algebraic pyramid yielding improved and robust solutions 
(Stiiben (1999)). 

In this paper we present a novel unified discrete multi- 
scale optimization scheme that acts directly on the energy 
(Fig. 1). Our multiscale framework is unified in the sense 
that it is application independent: different problems with 
different energies share the same multiscale scheme, mak- 
ing our framework widely applicable and general. More im- 
portantly, our multiscale method efficiently explores the dis- 
crete solution space through an irregular multiscale energy 
pyramid, constructed by energy-aware coarse-to-fine inter- 
polation. In a sense, our method may be considered as the 
discrete analogue of AMG: Instead of focusing attention on 
complicated optimization schemes, our framework exposes 
the multiscale landscape of the energy through energy-aware 
construction of the pyramid. This way even simple and lo- 
cal optimization methods can be incorporated into our pyra- 
mid yielding improved and robust approximations. In prac- 
tice, we apply our multiscale optimization method to a large 
set of challenging problems, including submodular and non- 
submodular, and achieve comparable or lower energy val- 
ues, than those obtained by the state-of-the-art methods. 

This work makes several contributions: 

(i) A novel unified multiscale framework for discrete op- 
timization: A wide variety of optimization problems, 
including segmentation, stereo, denoising, correlation- 
clustering, and others share the same multiscale frame- 
work. 

(ii) Any multiscale scheme requires a single-scale opti- 
mization method to refine the search at each scale. Our 
framework is also unified in the sense that it is not re- 
stricted to any specific optimization method. 

(iii) Energy-aware coarsening scheme. Variable aggrega- 
tion takes into account the underlying structure of the 
energy itself, thus efficiently and directly exposes its 
multiscale landscape. 

(iv) Provide discrete analogue to AMG. Incorporating even 
simple and local optimization methods into out en- 
ergy-aware pyramid yields good approximations. 

(v) Coarsening the labels. Our formulation allows for vari- 
able coarsening as well as for label coarsening. 

(vi) Optimizing hard non-submodular energies. We achieve 
significantly lower energy assignments on diverse com- 
puter vision energies, including challenging non-sub- 
modular examples. 
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1 . 1 Related work 

Algorithms for discrete energy minimization can work in 
the primal space or the dual space. Primal methods act on 
the discrete variables in the label space to minimize the en- 
ergy (e.g., Besag (1986); Boykov et al (2002); Rother et al 

(2007) ). Dual methods formulate a dual problem to the en- 
ergy and maximize a lower bound to the sought energy (e.g., 
Kolmogorov (2006)). Dual methods are recently considered 
more favorable since they do not only provide an approx- 
imate solution, but also provide a lower bound on how far 
this solution is from the global optimum. Furthermore, if a 
labeling is found with energy equals to the lower bound a 
certificate is provided that the global optimum was found. 
For the submodular energies it was shown (by Szeliski et al 

(2008) ) that dual methods tend to provide better approxima- 
tions with very tight lower bounds. However, using several 
classes of non-submodular energies, we empirically demon- 
strate that when it comes to challenging non-submodular 
energies, primal methods tend to provide better approxima- 
tions than dual methods, since in these cases the lower bound 
is no longer tight (Werner (2010)). 

Our multiscale framework constructs a multiscale en- 
ergy pyramid in terms of the primal space. We achieve com- 
parable performance when applied to submodular problems 
and superior performance when applied to non-submodular 
problems, while comparing it to the state-of-the-art methods 
(primal and dual). 

There are very few works that apply multiscale schemes 
directly to the discrete energy. A prominent example for 
this approach was suggested by Felzenszwalb and Hutten- 
locher (2006); it provides a coarse-to-fine belief propagation 
scheme restricted to regular diadic pyramid. A more recent 
work is that of Komodakis (2010) that provides an algebraic 
multigrid formulation for discrete optimization in the dual 
space. However, despite his general formulation Komodakis 
only provides examples using regular diadic grids of sub- 
modular energies. 

The work of Kim et al (201 1) proposes a two-scale scheme 
mainly aimed at improving run-time of the optimization pro- 
cess. Their proposed coarsening strategies can be interpreted 
as special cases of our unified framework. We analyze their 
underlying assumptions (Sec. 3.1), and suggest better meth- 
ods for efficient exploration of the multiscale landscape of 
the energy. 

The complexity of the optimization algorithms is affected 
by the number of discrete labels, as well as the number of 
variables. Existing optimization algorithms starts to fall be- 
hind when facing energies with large label space. Lempitsky 
et al (2007) proposed a method to exploit known properties 
of the metric between the labels to allow for faster mini- 



mization of energies with large number of labels. However, 
their method is restricted to energies with clear and known 
label metrics and requires training. In contrast, our frame- 
work addresses this issue via a principled scheme that builds 
an energy pyramid with decreasing number of labels with- 
out prior training and with fewer assumptions on the labels 
interactions. 

2 Multiscale Energy Pyramid 

We consider discrete pair- wise minimization problems, de- 
fined over a (weighted) graph (V, £ ), of the form: 

J57 (L) = 5^ <p< + Yl mj-<p(lhlj) (1) 

where V is the set of variables, £ is the set of edges, and the 
solution is discrete: L G {l,...,/} n , with n variables tak- 
ing I possible labels. Many problems in computer vision are 
cast in the form of (1) (see Szeliski et al (2008)). Further- 
more, we do not restrict the energy to be submodular, and 
our framework is also applicable to more challenging non- 
submodular energies. 

Our aim is to build an effective energy pyramid with a 
decreasing number of degrees of freedom. The key com- 
ponent in constructing such a pyramid is the interpolation 
method. The interpolation maps solutions between levels of 
the pyramid, and determines the original energy approxi- 
mation with fewer degrees of freedom. We propose a novel 
principled energy aware interpolation method such that the 
resulting energy pyramid efficiently exposes the multiscale 
landscape of the energy making low energy assignments ap- 
parent at coarse levels. 

Practically, it is counter intuitive to directly interpolate 
discrete label values, since they usually have only semantic 
interpretation. Therefore, we substitute an assignment L by 
an equivalent binary matrix representation U G {0,l} nxl . 
The rows of U correspond to the variables, and the columns 
corresponds to labels: Ui jCt = 1 iff variable i is labeled "a" 
(li = a). This representation allows us to interpolate dis- 
crete solutions, as will be shown in the subsequent sections. 

Expressing the energy (1) using U yields a relaxed quadratic 
representation (Rangarajan (2000)). This algebraic represen- 
tation forms the basis for our principled multiscale frame- 
work derivation: 

E (U) = Tr (DU T + WUVU T ) (2) 

i 

s.t.[/G{0,l} nxl , Y,Uia = l (3) 

a=l 

where W = {w i:j }, D G R nxl s.t. A,« = <^0)> and V G 
R lxl s.t. Va.,p = ip(ol, fi), a, G {!,.'..,/}. 
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An energy over n variables with Z labels is now parame- 
terized by (n, Z, D, W, V). 

We first describe the energy pyramid construction for a 
general interpolation matrix P, and defer the detailed de- 
scription of our novel interpolation to Sec. 3. 

Energy coarsening by variables 

Let (77/, Z, , Wf, V) be the fine scale energy. We wish 
to generate a coarser representation (n c , Z, T> c , VK C , V) with 
n c < rz/. This representation approximates E (iJf) using 
fewer variables: U c with only n c rows. 



coarse 



An interpolation matrix P G [0, if 



Vi, maps coarse assignment /7 C to fine assignment PU C . For 
any fine assignment that can be approximated by a coarse 
assignment U c , i.e., 

U f « P/7 C 



(4) 



Plugging (4) into (2): 
£ (t/ 7 ) = Tr (o f U fT + ^ F/7 /T ) 

« Tr ( D f U cT P T + W^PE/<W cT J P T 




(P T W / P) E/'W 



cT 



= Tr (D c U cT + W c U c VU cT ^j 
= £(/7 c ) 



(5) 



We have generated a coarse energy E (U c ) parameterized by 
(n c , Z, D c , W c , V) that approximates the fine energy E{Jjf). 
This coarse energy is of the same form as the original energy 
allowing us to apply the coarsening procedure recursively to 
construct an energy pyramid. 

Energy coarsening by labels 

So far we have explored the reduction of the number of de- 
grees of freedom by reducing the number of variables. How- 
ever, we may just as well look at the problem from a dif- 
ferent perspective: reducing the search space by decreasing 
the number of labels from If to Z c (Z c < If). It is a well 
known fact that optimization algorithms suffer from signifi- 
cant degradation in performance as the number of labels in- 
creases (Bleyer et al (2010)). Here we propose a novel prin- 
cipled and general framework for reducing the number of 
labels at each scale. 

Let (n, I?, Df, W, be the fine scale energy. Look- 

lf x \c 

ing at a different interpolation matrix P G [0, 1] , we 

interpolate a coarse solution by Jjf <— U C P T . This time the 
interpolation matrix P acts on the labels, i.e., the columns 




P = 



fine 



Figure 2 Interpolation as soft 
variable aggregation: fine vari- 
ables 1, 2, 3 and 4 are softly ag- 
gregated into coarse variables 1 
and 2. For example, fine vari- 
able 1 is a convex combination 
of .7 of 1 and .3 of 2. Hard aggregation is a special case where P is a 
binary matrix. In that case each fine variable is influenced by exactly 
one coarse variable. 



of U. The coarse labeling matrix U c has the same number 
of rows (variables), but fewer columns (labels). We use □ 
notation to emphasize that the coarsening here affects the 
labels rather than the variables. 
Coarsening the labels yields: 

E (U d ) = Tr ((^> / ^ > ) U dT + WU d [P T Vfp^j 

(6) 

Again, we end up with the same type of energy, but this 
time it is defined over a smaller number of discrete labels: 
(n, Z c , D\ W, V d ), where D c = Dfp and V d = P T V f P. 

The main theoretical contribution of this work is encap- 
sulated in the multiscale "trick" of equations (5) and (6). 
Formulating the interpolation as a linear operator (P) and 
plugging it in the quadratic energy representation (3) pro- 
vides a principled algebraic representation for our multi- 
scale framework. Our direct formulation is in contrast to 
the "ad-hoc" representation of Felzenszwalb and Hutten- 
locher (2006); Kim et al (201 1), and Komodakis (2010). Our 
scheme moves the multiscale completely to the optimization 
side and makes it independent of any specific application. 
We can practically approach now a wide and diverse family 
of energies using the same multiscale implementation. 

The effectiveness of the multiscale approximation of (5) 
and (6) heavily depends on the interpolation matrix P (P 
resp.). Poorly constructed interpolation matrices will fail to 
expose the multiscale landscape of the functional. In the 
subsequent section we describe our principled energy-aware 
method for computing it. 



3 Energy-aware Interpolation 

In this section we use terms and notations for variable coars- 
ening (P), however the motivation and methods are applica- 
ble for label coarsening (P) as well due to the similar alge- 
braic structure of (5) and (6). 

Our energy pyramid approximates the original energy 
using a decreasing number of degrees of freedom, thus ex- 
cluding some solutions from the original search space at 
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coarser scales. Which solutions are excluded is determined 
by the interpolation matrix P. A desired interpolation does 
not exclude low energy assignments at coarse levels. 

The matrix P can be interpreted as an operator that ag- 
gregates fine- scale variables into coarse ones (Fig. 2). Ag- 
gregating fine variables i and j into a coarser one excludes 
from the search space all assignments for which U ^ lj. 
This aggregation is undesired if assigning i and j to differ- 
ent labels yields low energy. However, when variables i and 
j are in agreement under the energy (i.e., assignments with 
li = lj yield low energy), aggregating them together allows 
for efficient exploration of low energy assignments. A de- 
sired interpolation aggregates i and j when i and j are 
in agreement under the energy. 

3.1 Measuring energy-aware agreements 

We provide two measures for agreement, one is used for 
computing variable-coarsening (P), while the other is used 
for label coarsening (P). 

Energy-aware agreement between variables: A reliable 
estimation for the agreement between the variables allows 
us to construct a desirable P that aggregates variables that 
are in agreement under the energy. A naive approach would 
assume that neighboring variables are always in agreement 
(this assumption underlies the diadic pyramids of Felzen- 
szwalb and Huttenlocher (2006); Komodakis (2010)). This 
assumption clearly does not hold in general and may yield 
an undesired interpolation matrix P leading to an ineffi- 
cient multiscale scheme. More recently Kim et al (2011) 
suggested to use the energy itself in order to estimate vari- 
able agreements. However, their ad-hoc methods are inca- 
pable of balancing the effect of the unary and pair- wise terms 
of the energy. 

Indeed it is difficult to decide which term dominates and 
how to fuse these two terms together. Therefore, we propose 
a novel empirical scheme for agreement estimation that nat- 
urally accounts for and integrates the influence of both the 
unary and the pair- wise terms. Moreover, our method ap- 
plies to all energies (2): submodular, non-submodular, met- 
ric V, arbitrary V, arbitrary W, energies defined over regu- 
lar grids and arbitrary graphs. 

Variables i and j are in agreement under the energy when 
k = lj yields relatively low energy value. To estimate these 
agreements we empirically generate several samples with 
relatively low energy, and measure the label agreement be- 
tween neighboring variables i and j in these samples. We 
use Iterated Conditional Modes (ICM) of Besag (1986) to 
obtain locally low energy assignments: Starting with a ran- 
dom assignment ICM chooses, at each iteration, for each 



variable, the label yielding the largest decrease of the energy 
function, conditioned on the labels assigned to its neighbors. 

This procedure may be viewed as a special case of sam- 
pling from a distribution: The assumed underlying distribu- 
tion is a Gibbs distribution, i.e., p (U) ex exp (— (U)). 
ICM may be interpreted as Gibbs sampling from the distri- 
bution at the limit T —> (i.e., the "zero-temperature" limit). 
Therefore, our samples may be viewed as zero-temperature 
Gibbs sampling with multiple restarts from the posterior (Koller 
and Friedman (2009)). 

Performing t = 10 ICM iterations with K = 10 ran- 
dom restarts provides us with K samples r Utiliz- 
ing the label-disagreement weights encoded in the matrix 
V, the disagreement between neighboring variable i and j 
is estimated as dij = V t k^k, where 1% is the label of 
variable i in the k th sample. Their agreement is then given 
by Cij = exp ^— , with a oc max V. 
Energy-aware agreement between labels: Agreements be- 
tween labels are easier to estimate, since this information is 
explicit in the matrix V that encodes the label-disagreement 

between any two labels. Setting c a ^ oc (v a ,(3^j » we get 
a "closed-form" expression for the agreements between la- 
bels. 

3.2 From agreements to interpolation 

Using our measure for the variable agreements, , we fol- 
low the Algebraic Multigrid (AMG) method of Brandt (1986) 
to first determine the set of coarse representatives and then 
construct an interpolation matrix P that softly aggregates 
variables according to their agreement. 

We begin by selecting a set of coarse representative vari- 
ables V c C V J , such that every variable in v / \v c is m 
agreement with V c . A variable i is considered in agreement 

with V c if ^2j e V c Ci o - P ^jeV f Ci i' That is ' every vari " 
able in is either in V c or is in agreement with other vari- 
ables in V c , and thus well represented in the coarse scale. 

We perform this selection greedily and sequentially, start- 
ing with V c = adding i to V c if it is not yet in agreement 
with V c . The parameter f3 affects the coarsening rate, i.e., 
the ratio n c /n^ , smaller f3 results in a lower ratio. 

At the end of this process we have a set of coarse rep- 
resentatives V c . The interpolation matrix P is then defined 
by: 

r Cij i e V f \V c , j e V c 
Pu(j) = 1 ieV c ,j=i (7) 
[ otherwise 

Where is the coarse index of the variable whose fine 
index is j (in Fig. 2: 1(2) = 1 and 7(3) = 2). 
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Algorithm 1: Discrete multiscale optimization. 

Input: Energy (n°, I, D°,W°,V). 
Output: U° 

Inits<— 0// fine scale 

// Energy pyramid construction: 

while | V s | > 10 do 

Estimate pair-wise agreements at scale s (Sec. 3.1). 

Compute interpolation matrix P s (Sec. 3.2). 

Derive coarse energy (n s+1 , D s+1 , VK S+1 , V) (Eq. 5). 

_ s + + 

// Coarse-to-fine optimization: 
while s > do 

U s <- Refine (£> s ) 

C/ s— 1 = P S U S / / interpolate a solution 
s 

where Refine (U s ) uses an existing single-scale method to 
optimize the energy (n s , I, D s , W s , V) with U s as an 
initialization. 



We further prune rows of P leaving only S maximal en- 
tries. Each row is then normalized to sum to 1. Throughout 
our experiments we use /3 = 0.2 = 0.75), 5 = 3 (5 = 2) 
for computing P (P resp.). 



4 A Unified Discrete Multiscale Framework 

So far we have described the different components of our 
multiscale framework. Alg. 1 puts them together into a mul- 
tiscale minimization scheme. Given an energy (n, I, D, W, V), 
our framework first works fine-to-coarse to compute inter- 
polation matrices {P s } that construct the "energy pyramid": 
{(n s , Z, D s , W s , V)} s=0 s . Typically we endup at the coars- 
est scale with less than 10 variables. As a result, exploring 
the energy at this scale is robust to the initial assignment of 
the single-scale method used 1 . 

Starting from the coarsest scale, we apply a simple sin- 
gle-scale optimization method (e.g., ICM, a-expansion, etc.). 
Since there are very few degrees of freedom at the coarsest 
scale, these single-scale methods are likely to obtain a low- 
energy coarse solution. This stems from the fact that at the 
coarsest scale the large basins of attraction of the energy are 
easily accessed and explored. 

At each scale s, the coarse solution U s is interpolated to 
a finer scale s — 1: Z7 S-1 <- P S U S . At the finer scale U s _1 
serves as a good initialization for optimizing the energy with 
the same single-scale optimization method. These two steps 
of interpolation followed by refinement are repeated for all 
scales from coarse to fine. 

1 In practice, at the coarsest scale we use "winner-take-all" initial- 
ization as suggested by (Szeliski et al 2008, §3.1). 



Single- scale optimization methods for discrete energies 
generally accept only discrete assignments (i.e., the binary 
constraints (3)) as an initialization. However, the interpo- 
lated solution U s ~ l , at each scale, might not satisfy the bi- 
nary constraints (3). Therefore, we round each row of Z7 S_1 
by setting the maximal element to 1 and the rest to 0. 

The most computationally intensive module of our frame- 
work is the empirical estimation of the variable agreements. 
The complexity of the agreement estimation is O (\£\ - I), 
where \£\ is the number of non-zero elements in W and / is 
the number of labels. However, it is fairly straightforward to 
parallelize this module. 

It is now easy to see how our framework generalizes 
Felzenszwalb and Huttenlocher (2006), Komodakis (2010) 
and Kim et al (2011). They are restricted to hard aggrega- 
tion in P. Felzenszwalb and Huttenlocher (2006) and Ko- 
modakis (2010) use a multiscale pyramid, however their vari- 
able aggregation is not energy-aware, and is restricted to di- 
adic pyramids. On the other hand, Kim et al (201 1) have lim- 
ited energy-aware aggregation, applied to two level "pyra- 
mid" only. 



5 Experimental Results 

We evaluated our multiscale framework on a diversity of 
discrete optimization tasks 2 : ranging from challenging non- 
submodular synthetic and co-clustering energies, to low-level 
submodular vision energies such as denoising and stereo. In 
all of these experiments we minimize a given publicly avail- 
able benchmark energy, we do not attempt to improve on the 
energy formulation itself. 

For every instance of energy minimization problem in 
these benchmarks we construct an energy pyramid using our 
method. We then use our energy pyramid to efficiently ex- 
ploit the multiscale landscape of each energy to improve 
optimization results of existing methods. In the following 
experiments we use ICM (Besag (1986)), a/3-swap and a- 
expansion (large move making algorithms of Boykov et al 
(2002)) as representative single-scale primal optimization 
algorithms. Each step of the large move making algorithms 
of Boykov et al (2002) solves a reduced binary problem. For 
the challenging non-submodular energies these binary steps 
are approximated using QPBO(I) of Rother et al (2007). 

We follow the protocol of Szeliski et al (2008) that uses 
the lower bound of TRW-S (Kolmogorov (2006)) as a base- 
line for comparing performance of different optimization 
methods on different energies. We report the ratio between 

2 code available at www.wisdom.weizmann.ac.il/ 
~ bagon /mat lab . html. 
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Table 1 Synthetic results: Showing percent of achieved energy value 
relative to the lower bound (closer to 100% is better) for ICM, a(3- 
swap, a-expansion and TRW-S for varying strengths of the pair-wise 
term (A = 5,10,15, stronger — >• harder to optimize.) 



GT 



A 


IC 

Ours 


M 
single 
scale 


Swap(( 
Ours 


2PBO) 
single 
scale 


Expand 
Ours 


;qpbo) 

single 
scale 


TRW-S 


5 

10 
15 


112.6% 
123.6% 
127.1% 


115.9% 
130.2% 
135.8% 


108.9% 
118.5% 
122.1% 


110.0% 
120.2% 
124.1% 


110.5% 
121.5% 
124.6% 


110.0% 
121.0% 
125.1% 


116.6% 
134.6% 1 
138.3% 




the resulting energy value and the lower bound (in percents), 
closer to 100% is better. 

These experiments show how our energy-aware construc- 
tion of the pyramid efficiently exposes the underlying mul- 
tiscale landscape of the energy. This way even simple and 
very local optimization scheme (applied at each scale) can 
achieve good approximations. The most prominent exam- 
ple is ICM (Besag (1986)): this greedy local coordinate de- 
scend algorithm performs poorly when applied directly to 
the energy. It converges very rapidly to a sub-optimal lo- 
cal solution (see, e.g., Szeliski et al (2008)). However, when 
used within our multiscale framework, local search at coarse 
scales amounts to very large and non-local search in the 
fine scale. This example stresses the advantage of construct- 
ing energy-aware multiscale framework: Exposing the mul- 
tiscale landscape of the energy helps to achieve good ap- 
proximation even when using simple and local methods at 
each scale. 

When incorporating large move making algorithms as 
the single-scale optimization in our framework, there is a 
consistent improvement of multiscale over these single-scale 
scheme. In addition, TRW-S is a dual method and is consid- 
ered state-of-the-art for discrete energy minimization (Szeliski 
et al (2008)). However, we show that when it comes to non- 
submodular energies it struggles behind the large move mak- 
ing algorithms and even ICM. Moreover, for these challeng- 
ing energies, our multiscale framework gives a significant 
boost in optimization performance, achieving significantly 
lower energy values than the TRW-S. 



5.1 Synthetic 

We begin with synthetic non-submodular energies defined 
over a 4-connected grid graph of size 50 x 50 (n = 2500), 
and I = 5 labels. The unary term D ~ J\f (0, 1). The pair- 
wise term V a p = Vp a ~ U (0, 1) (V aa = 0) and Wij = 
Wji ~ \ • U (—1,1). The parameter A controls the relative 
strength of the pair- wise term, stronger (i.e., larger A) results 
with energies more difficult to optimize (see Kolmogorov 
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Figure 3 Chinese characters inpainting: Visualizing some of the in- 
stances used in our experiments. Columns are (left to right): The origi- 
nal character used for testing. The input, partially occluded character. 
ICM and QPBO results both our multiscale and single scale results. 
Results of TRW-S and results ofNowozin et al (2011) obtained with a 
very long run of simulated annealing (using Gibbs sampling inside the 
annealing). 



(2006)). Table 1 shows results, averaged over 100 experi- 
ments. 

Using our multiscale framework to perform coarse-to- 
fine optimization of the energy yields significantly lower en- 
ergies for all single-scale methods used (ICM, a-expansion 
and af3- swap) and TRW-S: The percents in "ours" column 
are closer to 100% than the results of the other methods. 

Despite the fact that these synthetic energies were ran- 
domly generated without any underlying structure, still there 
is a multiscale landscape to the functional. Our multiscale 
framework constructs an energy pyramid that exposes this 
underlying multiscale landscape, resulting with better and 
more efficient optimization results. 

The resulting synthetic energies are non-submodular (since 

may become negative). For these challenging energies, 
state-of-the-art dual method (TRW-S) performs rather poorly 3 
(worse than single scale ICM) and there is a significant gap 
between the lower bound and the energy of the actual primal 
solution provided. This gap might be due to the fact that for 
these challenging no-submodular energies the dual bound is 
not tight (Werner (2010)). 

3 We did not restrict the number of iterations, and let TRW-S run 
until no further improvement to the lower bound is made. 
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Figure 4 Energies of Chinese characters inpainting: Box plot show- 
ing 25%, median and 75% of the resulting energies relative to reference 
energies ofNowozin et al (2011 ) (lower than 100% = lower than base- 
line). Our multiscale approach combined with QPBO achieves consis- 
tently better energies than baseline, with very low variance. TRW-S 
improves on only 25% of the instances with very high variance in the 
results. 

Table 2 Energies of Chinese characters inpainting: table showing 
(a) mean energies for the inpainting experiment relative to baseline of 
Nowozin et al (2011) (lower is better, less than 100% = lower than 
baseline), (b) percent of instances for which strictly lower energy was 
achieved. 
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5.2 Chinese character inpainting 

We further experiment with non-submodular learned binary 
energies of (Nowozin et al 201 1, §5.2) 4 . These 100 instances 
of non-submodular pair- wise energies are defined over a 64- 
connected grid. These energies were designed and trained 
to perform the task of learning Chinese calligraphy, repre- 
sented as a complex, non-local binary pattern. 

Our experiments show how approaching these challeng- 
ing energies using our unified multiscale framework allows 
for better approximations. Table 2 and Fig. 3 compare our 
multiscale framework to single-scale methods acting on the 
primal binary variables. Since the energies are binary, multi- 
label large move making algorithms boils down to binary 
QPBO. We also provide an evaluation of a dual method (TRW- 
S) on these energies. In addition to the quantitative results, 
Fig. 4 provides a visualization of some of the instances of 
the restored Chinese characters. 

For these challenging non-submodular 'real world" en- 
ergies our multiscale framework provides significant improve- 
ment over single-scale scheme. 

4 available at www.nowozin.net/sebastian/papers/ 
DTF_CIP_instances . zip. 



Table 3 Co-clustering results: Baseline for comparison are state- 
of-the-art results of Glasner et al (2011). (a) We report our results 
as percent of the baseline: smaller is better, lower than 100% even 
outperforms state-of-the-art. (b)We also report the fraction of energies 
for which our multiscale framework outperform state-of-the-art. 
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single 
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(a) 


99.9% 


177.7% 


99.8% 


101.5% 


99.8% 


101.6% 


176.2% 


(b) 


55.6% 


0.0% 


71.8% 


15.5% 


70.8% 


11.6% 


0.5% 



5.3 Co-clustering 

The problem of co-clustering addresses the matching of su- 
perpixels within and across frames in a video sequence. Fol- 
lowing (Bagon and Galun 201 1, §6.2), we treat co-clustering 
as a discrete minimization of non-submodular Potts energy. 
We obtained 77 co-clustering energies, courtesy of Glas- 
ner et al (2011), used in their experiments. The number of 
variables in each energy ranges from 87 to 788. Their spar- 
sity (percent of non-zero entries in W) ranges from 6% to 
50%, The resulting energies are non-submodular, have no 
underlying regular grid, and are very challenging to opti- 
mize Bagon and Galun (201 1). 

Table 3 compares our discrete multiscale framework com- 
bined with ICM, a/3-swap and a-expansion. For these ener- 
gies we use a different baseline: the state-of-the-art results 
of Glasner et al (2011) obtained by applying specially tai- 
lored convex relaxation method (We do not use the lower 
bound of TRW-S here since it is far from being tight for 
these challenging energies). Our multiscale framework im- 
proves state-of-the-art for this family of challenging ener- 
gies and significantly outperform TRW-S. 

Furthermore, the results demonstrated in the last three 
sub- sections highlight the advantage that primal methods 
has over dual ones when it comes to challenging non-sub- 
modular energies. 



5.4 Submodular energies 

We further applied our multiscale framework to optimize 
less challenging submodular energies. We use the diverse 
low-level vision MRF energies from the Middlebury bench- 
mark Szeliski et al (2008) 5 . 

For these submodular energies, TRW-S (single scale) per- 
forms quite well and in fact, if enough iterations are allowed 
its lower bound converges to the global optimum. As op- 
posed to TRW-S, large move making and ICM do not always 
converge to the global optimum. Yet, we are able to show a 

5 Available at vision . middlebury . edu/MRF/. 
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Table 4 Stereo: Showing percent of achieved energy value relative to 
the lower bound (closer to 100% is better). Visual results for these 
experiments are in Fig. 5. Energies from Szeliski et al (2008). 
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Table 5 Denoising and inpainting: Showing percent of achieved en- 
ergy value relative to the lower bound (closer to 100% is better). Visual 
results for these experiments are in Fig. 6. Energies from Szeliski et al 
(2008). 
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significant improvement for primal optimization algorithms 
when used within our multiscale framework. Tables 4 and 5 
and Figs. 5 and 6 show our multiscale results for the differ- 
ent submodular energies. 



5.5 Comparing variable agreement estimation methods 

As explained in Sec. 3 the agreements between the variables 
are the most crucial component in constructing an effective 
multiscale scheme. In this experiment we compare our en- 
ergy-aware agreement measure (Sec. 3.1) to three methods 
proposed by Kim et al (2011): "unary-diff ', "min-unary- 
diff ' and "mean-compat". These methods estimate the agree- 
ment based either on the unary term or the pair- wise term, 
but not both. We also compare to an energy-agnostic mea- 
sure, that is Cij = 1 Mij G £ , this method underlies Felzen- 
szwalb and Huttenlocher (2006); Komodakis (2010). 

For each energy we estimate variable agreements using 
these five different approaches. These different estimations 
are then used to construct five different energy-pyramids (as 
described in Sec. 3.2). Better agreement estimation will re- 
sults with better exploration of the multiscale landscape of 
the energy yielding better optimization results. We use ICM 
with each of the five energy-pyramids to evaluate the influ- 
ence these methods have on the resulting multiscale perfor- 
mance for three representative energies. 

Fig. 7 shows percent of lower bound for the different 
energies. Energy-pyramids constructed based on our agree- 
ment estimation method consistently outperforms all other 
methods, and successfully balances between the influence 
of the unary and the pair- wise terms. 



Figure 7 Comparing agreements estimation methods: Graphs 
; showing percent of lower bound (closer to 100% is better) for differ- 
~ent methods of computing variable-agreements. One bar is cropped at 
"150%. Our energy-aware measure consistently outperforms all other 
'methods. As a reference, results of single- scale optimization are shown 
on the right. 

Table 6 Coarsening labels: Working coarse-to-fine in the labels do- 
main. We use 5 scales with coarsening rate of '~ 0.7. Number of vari- 
ables is unchanged. Table shows percent of achieved energy value rel- 
ative to the lower bound (closer to 100% is better), and running times. 
These results were obtained using afi- swap for optimizing each scale. 



Energy 


#labels 


#labels 


Ours 


single 


(finest) 


(coarsest) 


scale 


Penguin 


256 


67 


103.6% 


111.3% 


(denoising) 


128 [sec] 


253 [sec] 


Venus 


20 


4 


106.0% 


128.7% 


(stereo) 




100 [sec] 


130 [sec] 



5.6 Coarsening labels 

a/3-swap does not scale gracefully with the number of la- 
bels. Coarsening an energy in the labels domain (i.e., same 
number of variables, fewer labels) proves to significantly 
improve performance of a/3-swap, as shown in Table 6. For 
these examples constructing the energy pyramid took only 
milliseconds, due to the "closed form" formula for estimat- 
ing label correlations. 

Our principled framework for coarsening labels improves 
a/3-swap performance for these energies. 



6 Conclusion 

This work presents a unified multiscale framework for dis- 
crete energy minimization that allows for efficient and di- 
rect exploration of the multiscale landscape of the energy. 
We propose two paths to expose the multiscale landscape of 
the energy: one in which coarser scales involve fewer and 
coarser variables, and another in which the coarser levels 
involve fewer labels. We also propose adaptive methods for 
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Figure 5 Stereo: Note how our multiscale framework drastically improves ICM results, visible improvement for afi-swap can also be seen in the 
middle row (Venus). Numerical results for these examples are shown in Table 4. Energies from Szeliski et al (2008). 
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Figure 6 Denoising and inpainting: Single scale ICM is unable to cope with inpainting: performing local steps it is unable to propagate 
information far enough to fill the missing regions in the images. On the other hand, our multiscale framework allows ICM to perform large steps 
at coarse scales and successfully fill the gaps. Numerical results for these examples are shown in Table 5. Energies from Szeliski et al (2008). 



energy-aware interpolation between the scales. Our multi- 
scale framework significantly improves optimization results 
for challenging energies. 

Our framework provides the mathematical formulation 
that "bridges the gap" and relates multiscale discrete op- 
timization and algebraic multiscale methods used in PDE 
solvers (e.g., Brandt (1986)). This connection allows for meth- 
ods and practices developed for numerical solvers to be ap- 
plied in multiscale discrete optimization as well. 
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